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Abstract. Given a gene expression data array of a list of bladder cancer patients with 
their tumor states, it may be difficult to determine which genes can operate as dis¬ 
ease markers when the array is large and possibly contains outliers and missing data. 
An additional difficulty is that observations (tumor states) in the regression problem 
are discrete ones. In this article, we solve these problems on concrete data using first 
a clustering approach, followed by Least Absolute Shrinkage and Selection Operator 
(LASSO) estimators in a nonlinear regression problem involving discrete variables, as 
described in the brand-new research work of Plan and Vershynin. Gene markers of the 
most severe tumor state are finally provided using the proposed approach. 

1 Introduction 

In this article, we present a methodology to perform selection among genes based 
on their expression in various groups of patients, in order to find new genetic markers 
for specific pathologies. Our approach is based on clustering the denoised data and 
computing a LASSO (Least Absolute Shrinkage and Selection Operator) estimator, in 
order to select the relevant genes. This latter belongs to the class of penalized regression 
estimators where the penalty is a multiple of the £i-norm of the regression vector. 

We apply the proposed methodology to a set of gene expression data from patients 
with bladder cancer, where four possible subtypes of tumor state are considered thus 
making the observations discrete in the regression problem. Our primary objective in 
the present work is to extract a set of relevant genes for the bladder cancer under study. 
A secondary objective is to emphasize the following fact: although the regression prob¬ 
lem we consider is nonlinear and involves discrete variables, the LASSO can still be 
used if selection is performed for prediction. This is due to the recent work of Plan 
and Vershynin | fT4] |, which merits advertising in applications to biology where much 
of methodology concentrates on the binary logistic model and too often neglects more 
complicated output^ 

The remainder of this article is as follows. The bladder cancer data that have been 
used for illustration purpose in this study are described in the next section. Recalls 
about Gaussian mixture and selection model are provided in Section together with a 

'For instance, the data in our study can be modelled using ordered polytomic regression and needs a 
penalized likelihood estimator. This latter is hard to find in current statistical software libraries whereas 
the LASSO is widely available. 
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principal component analysis (PCA) of the data under study. The generalized LASSO 
of Vershynin and Plan is reealled and applied to our data in Seetion This researeh 
work ends by a conclusion that proposes a gene marker for the last tumor state. 

2 Presentation of the data 

To accurately diagnose bladder cancer is a Public Health priority as for instance, in 
2013, 10,000 new patients were affected by this cancer in France. A promising way to 
improve the diagnosis and to design efficient treatments is to determine which genes are 
responsible for such a cancer. Obviously, sueh medieal treatments must depend on how 
much the tumor is developed. To aehieve this goal, gene expression data together with 
corresponding state of the malignant tumor in the bladder have been recently collected 
from 100 patients in the Lyon region, France 0. 



(a) 2 components (b) 3 components 

Figure 1: Principal component analyses of the denoised data 
In this study, 34 genes of interest have been chosen, while the tumor state has been 

decomposed in 4 classes, namely: 

• Ta : noninfiltrating tumor in Urothelium; 

• Tla : noninfiltrating tumor in Urothelium and parts of the ehorion; 

• Tib : noninfiltrating tumor in Urothelium and the full chorion; 

• > Tl : infiltrating tumor. 

Remark that, in the standard classification, the last group of the list above incorporates 
states from T2 to TAh. Mathematieally speaking, the data are thus eonstituted by this 
first column providing the tumor state (discrete), followed by 34 other columns that 
quantify the expression of the selected genes. Each row of the 100 x 35 matrix is 
associated to one of the 100 given patients, and the objective is to determine, for eaeh 
tumor state, which gene(s) must be selected as the best marker eandidate(s). However, 
due to its experimental origin, this raw array eontains outliers and eorrupted data. 

To extract the relevant features in a given dataset is a difficult task, recently re¬ 
solved in the non-negative data case with the Non-negative Matrix factorization (NMF) 
method. The objective of our previous researeh work |[^ was to extend this method 
to the ease of missing and/or corrupted data due to outliers. To do so, data have been 
denoised, missing values have been imputed, and outliers have been detected while 
performing a low-rank non-negative matrix factorization of the recovered matrix. To 
achieve this goal, a mixture of Bregman proximal methods and of the Augmented La- 
grangian scheme have been used on our dataset in |[^, in a similar way to the so-called 
Alternating Direetion of Multipliers method. In what follows, we thus deal with two 
arrays: the raw one and the denoised one. A principal component analysis of this latter 
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is provided in Figure Next stages consist of determining if these denoised data can 
be clusterized well (which is not the case for the raw data), if the optimal number of 
clusters corresponds well to the number of tumor states, and if each cluster is coherent, 
that is, if each patient being in tumor state k is too in cluster number k. 

3 Clustering of the gene expression data 

3.1 Gaussian mixtures 

Finite Gaussian mixture models (GMM) are widely used in a great number of appli¬ 
cation fields as a mean to perform model based classification. From pattern recognition 
to biology, from quality control to finance, many examples have shown the pertinence 
of the Gaussian mixture model approach pT] |. In GMM data 1),..., are assumed 
independent and identically distributed (i.i.d.) and to be drawn from the density: 

K 

( 1 ) 

k=i 


where 


^ ^=exp(-(2) 

^(27r)<^det(S=^)) ^ ^ 

and where the vector 9* = {pi,... pi,..., E];,..., E^) is an unknown mul¬ 

tidimensional parameter. To this model, we traditionally associate an extended model 
using the notion of complete data. In mixture models, the complete data are i.i.d. cou¬ 
ples of the form (1^, Z*), where Z, is a multinomial random variable taking values in 
{1,... ,K} with P{Zi = k) = pI. This latter represents the index of the mixture com¬ 
ponent from which observation i was drawn. We assume that, conditionally on the event 
Zi = k,Yi has density 


1 

V(2>r)“det(sa “P 




(3) 


The variables Zi,..., Z^^ being unobserved, they are usually called “latent variables”. 
The standard approach for estimating 9* is the maximum likelihood methodology that 
consists of finding 9 which maximizes the log-likelihood function 


n K 

l{9) = J]log ( (4) 

i=\ k=l 


over the set 


K 

0 = |(pi,...,Pi^,/ii,...,)Ui^,Ei,...,EK) I Pk G G G and '^Pk = l} 

k=l 


where Sj" denotes the set of all symmetric positive semidefinite matrices and M+ is the 
set of nonnegative real numbers. The usual way to maximize the log-likelihood is the so- 
called EM algorithm ||^[^, or its more efficient componentwise variants, e.g., plfTO). 


3.2 Results 

The problem of choosing the number of clusters K a priori is a difficult one. This 
is usually done by comparing the penalized maximum likelihood values for different 
values of K and choosing the maximum one. Model selection can be performed using 
the Bayesian Information Criterion (BIC). This criterion is the opposite of the maximum 
likelihood value penalized with log(n) x the number of real parameters to estimate. 
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(a) Using all components (b) On the 3 principal components 

Figure 2: Determination of the optimal number of clusters in denoized data: number of clusters for easting 
values and (log of) Bayesian Information Criterion (BIC) for northing ones. 

The first attempt on raw data failed to provide any useful information, due to outliers 
and missing data. This criterion has then been applied on the gene expression part of 
our denoised array, to determine the best way to cluster the set of genes. The number of 
mixture components has ranged from 1 to 29, and at each time the Bayesian information 
criterion for the current model fit has been computed (more precisely, for pretty prints, 
the logarithm of a: — minBIC, where minBIC is the smallest obtained BIC). As can 
be seen in obtained plot depicted in Figure]^ the criterion has not provide any obvious 
result when considering the whole data. However, applying it on the 3 principal com¬ 
ponents shown in Figure [T] emphasizes that the optimal number of clusters is 4. Such a 
result were encouraging, as we have 4 tumor states in the array. We then have performed 
a PCA on the raw data while colorizing each of the 4 clusters provided by the Gaussian 
mixture model. Obtained results are depicted in Figure they are coherent with the 
tumor state of each patient. Note that, at this stage, each cluster is a Gaussian one. 



Figure 3: PCA on raw data, colorized according to their cluster provided by the GMM. 


4 Generalized LASSO of Vershynin and Plan 

4.1 Background 

In our data set, we have to explain the tumor state based on the expression of certain 
genes. Since the state is a discrete variable, an appropriate model should be the multi¬ 
nomial logistic model. This model could be refined and a more appropriate one could 
be the ordered polytomic regression model. The main difficulty in this kind of model is 
that when the number of observations is of the same order as the number of covariates, 
the maximum likelihood estimator might not perform well. Moreover, selection of some 
of the most relevant covariates might be an important way to extract meaningful infor¬ 
mation from the data. Performing variable selection can be done using the Bayesian 
Information Criterion that we presented in Section |3.I[ However, if one wants to try 


all models built on the expression of less than, e.g., 15 genes, the computational effort 
might be overwhelming in many gene expression studies. 
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In order to overcome this issue, an important contribution was made by Tibshi- 
rani 0. The analysis was then extended to the case of even more covariates than obser¬ 
vations in Q, while extension to Generalized Linear Models was then proposed, see Q 
for a very useful and mathematically deep reference. The LASSO can be described as 
the solution of the following penalized least squares problem 

min hy - Xb\\l + X\\b\\i, (5) 

OGKP Z 

where ||6||i is the fi-norm of b, i.e., ||6||i = J2j=i=p\b \ V vector containing 
the n outputs (the tumor state), X is the matrix whose columns are the expression of 
each of the p genes in the n patients, while A is the relaxation parameter. 

Under certain properties of the matrix X, the LASSO estimator enjoys good variable 
recovery properties |[^ Thm 1.4]. If the matrix has very correlated columns, variable 
recovery will generally fail but under a mixture model, good prediction bounds can still 
be obtained as proven in Q. Since in our study, we suspect that high correlations exist 
between the genes under study, we cannot expect to obtain good variable recovery with 
the LASSO. However, we can still believe that the variable selection performed with 
the LASSO is relevant with respect to prediction. Various theoretical values for the 
relaxation parameter A have been proposed in the literature, see, e.g., 0 or 0 


4.2 The generalized LASSO 

An important point to address in our study is the discrete nature of the output vector 
y. However, as stated previously, extending the analysis of the LASSO to the ordered 
polytomic model seems quite difficult and cumbersome to obtain, while useful and ro¬ 
bust software is currently not available. One very interesting question is whether the 
LASSO can still be applied in the nonlinear context where the output (tumor state) is 
ordered and discrete as in our problem? Fortunately, the answer is yes, as it was re¬ 
cently proven by Plan and Vershynin [ fT4| |. Moreover, their analysis applies to more 
general penalization terms than just the norm. One important assumption in | fT4| | 
is that the design, i.e., X, is Gaussian. Therefore, an important step before using this 
approach is that the data set must be clustered into Gaussian clusters, as it has been 
achieved via the mixture model in Section |TT| Then the LASSO should be performed 
clusterwise. In what follows, we only focus on “black” cluster (> Tl tumor). 

The choice of the relaxation parameter A is still an important and difficult problem in 
practice. In the sequel, we restrict our attention to the results obtained using the LARS 
method, which consists of computing the estimator for a continuous set of values of A, 
and we analyze the obtained trajectory in order to select the most important genes. 


5 Conclusion on experimental results 

Eight genes still remain in the Gaussian infiltrating tumor cluster under consider¬ 
ation, corresponding in our array to columns: 1 and 2 (genes ATF3 and Bell likeM 
respectively), 11 (HMGB2), 20 (MMPll), 21 (ORC6L), 25 (RAD54L), 29 (TK), and 34 
(Vimentin). A value of eq. @ ranged between 0.1 and 0.03 with a step of 0.001, and for 
each A the penalized least square problem has been solved. Each resolution has led to 
the associated parameter vector 6 G E®, where the k-th coordinate of b corresponds to 
the k-th gene of the considered cluster, according to its column occurrence in the raw 
data array. We thus have plotted the LASSO result trajectories, putting A in the abscissa 
axis and each coordinate of b in the ordinate axis, see Figure]^ 

As can be seen, HMGB2 (column number 11) obviously steps out of line, appearing 
as the most important gene reflecting the infiltrating tumor cluster. This is not surpris¬ 
ing, as the high mobility group box 2 (HMGB2) overexpression has been observed in 
several human tumor types, and is involved in cancer progression and prognosis, espe¬ 
cially in the bladder one p^[T3]|. Generally speaking, all genes have a positive impact. 
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Figure 4: Obtained trajectory in the Lars method on infiltrating tumor Gaussian cluster 
except the second most important gene, namely Bcl2 likel4 (2), which has an effect 
diametrically opposed to the other ones. This can be explained by the well known fact 
that overexpression of this gene, a candidate tumor suppressor [|T|, induces apoptosis in 
cells. Finally, by order of importance, the two next genes are respectively ATF3 (1) and 
MMPll (20), even though the small value of A may blur the information raised by these 
trajectories. HMGB2 and these latter can thus act as genetic markers of > T1 tumors. 

This work was partially funded by Elements Transposables Franche-Comte grant. 
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